Use case example

Terminal access

In Linux, you can open a terminal with ctrl + alt + t.

In Windows, you can search for the command prompt (cmd in start menu then open command prompt) or you can access PowerShell by pressing Shift + mouse right-click on desktop or in the desired folder.

Generate a ssh key pair

In order to access the DGX, you need a SSH key. To generate it, you can use ssh-keygen in a terminal. Then you can get the public key in ~/.ssh/filename.pub (Linux/MacOS) or in C:\Users\USERNAME\.ssh\filename.pub (Windows).

Connection

Once your account has been created, you can now access the DGX using SSH.

Using a SSH client on Linux/MacOS

By default, you can use the following command:

user@mycomputer:~$ ssh username@hubia-dgx.centralesupelec.fr

You can also edit your ~/.ssh/config file by adding, for example:

Host dgx
    HostName hubia-dgx.centralesupelec.fr
    User username
    IdentityFile ~/.ssh/private_key_name

which will allow you to directly use the command ssh dgx.

Using a SSH client on Windows (PuTTY)

To access the DGX from a Windows machine with a SSH client, you can use PuTTY.

When asked to configure your connection, you have to fill the "Host Name (or IP address)" field with "hubia-dgx.centralesupelec.fr" and ensure that the "connection type" is set to "SSH".

When Putty is configured correctly, you just have to click the "Open" button and a refer to the previous section in order to log in.

Network restrictions

Please note that the DGX can only be accessed from the Eduroam network on the Paris-Saclay campus or via VPN.

On the DGX

When the SSH tunnel is established, congratulations, you are now connected to the DGX!

From there, you can ask for an interactive session or launch a batch job (see page on slurm jobs management for more information on the different options, or the examples below).

Using VSCode

The DGX is a server, and you don't have access to a graphical interface. However, now that the configuration is done, you can launch VSCode on your machine and connect to the DGX using the Remote-SSH extension.

Once this extension has been installed, you can click on the icon at the bottom left of VSCode, and select Remote-SSH: Connect to Host.... You can then select dgx from the list of hosts, and connect to the DGX. You will then be able to open your user folder.

Some users have encountered problems with the VSCode Remote-SSH extension. To resolve them, they had to perform the following operations: Ctrl+Shift+P, Remote SSH: Uninstall VS Code Server from Host, Remote SSH: Connect current window to host.

More information on VSCode key-based authentication on this guide.

Using python virtual environment

The easiest way to manage your libraries is to create your own virtual environment. You can make one for all your projects, or one per project to avoid conflicts. The important thing is not to install your libraries directly on the global environment and not to forget to activate your environment before launching your code.

To create a python virtual environment, simply type python -m venv your_env_name in your terminal, from the directory where you want to create it.

You can then activate it by typing source ./your_env_name/bin/activate. You can now install your libraries with pip install some_cool_library_I_need.

At the end of your session, you can close your venv with the command deactivate.

If you are using VSCode, in order to activate your venv, you first need to open the project and then click on the environment name in the bottom right corner of VSCode. You can then select your environment manually by going to your environment folder, then bin, and selecting python.

Using an interactive session

With an interactive session, you can write your code and test it on small datasets running on the MIGs (Multi-Instance GPUs): * the partition must be interactive10 ; * the reserved MIG must be one 1g.10gb * the total CPUs requested (ntasks * cpus-per-task) must not exceed 4 CPUs * example for a one hour interactive session:

srun --partition=interactive10 --gres=gpu:1g.10gb:1 --ntasks=1 --cpus-per-task=4 --time=<1:00:00> --pty bash

The max walltime, which is also the default, is two hours. Once your session starts (it can take some time if the MIGs are already used), you can activate you virtual environment and start working on your code and your tests!

At any time, you can close the session with exit, which will also end the job and free the MIG for other users.

Using a batch job

Now that your code is ready to run on bigger datasets, you want to use the MIG for longer computing, and/or use bigger MIGs. For that, you will use the batch job.

Writing the job script

Suppose you want to execute a main.py file. Here's a fairly general template of a script job.batch which run a main.py file, including all mandatory directives (partition, gres, ntasks and cpus-per-task):

#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output /path/to/slurm-%j.out
#SBATCH --error /path/to/slurm-%j.err

## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10

## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1

## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

## Perform run
python3 /path/to/main.py

Another one using a virtual environment, a logslurm directory (for the output and error) and a working_directory (containing the main.py) in the user home:

#!/bin/bash
#
#SBATCH --job-name=job
#SBATCH --output=~/logslurm/slurm-%j.out
#SBATCH --error=~/logslurm/slurm-%j.err

## For partition: either prod10, prod 20, prod 40 or prod80
#SBATCH --partition=prod10

## For gres: either 1g.10gb:1 for prod10, 2g.20gb:1 for prod20, 3g.40gb:1 for prod40 or A100.80gb:1 for prod80.
#SBATCH --gres=gpu:1g.10gb:1

## For ntasks and cpus: total requested cpus (ntasks * cpus-per-task) must be in [1: 4 * nMIG] with nMIG = VRAM / 10 (hence 1, 2, 4, 8 for 1g.10gb, 2g.20gb, 3g.40gb, A100.80gb).
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

## Virtual environment
source ~/env/bin/activate

## Perform run
CUDA_VISIBLE_DEVICES=1 time python ~/working_directory/main.py

In both examples, standard output (stdout) will be in the slurm-%j.out file (the %j will be replaced by the job ID automatically) and the standard error (stderr) will be in the slurm-%j.err file.

Please note that the directories you specify for the output and the error files must already exist.

Submiting the job script

You need to submit your script job.batch with:

$ sbatch /path/to/job.batch
Submitted batch job 29509

which responds with the JobID attributed to the job. For example here, JobID is 29509. The JobID is a unique identifier that is used by many Slurm commands.

Monitoring the job

The squeue command shows the list of jobs:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
29509 prod10        job  username  R       0:02      1 dgxa100

You can change the default format through the SQUEUE_FORMAT variable. For example by adding the following in your .bash_profile:

export SQUEUE_FORMAT="%.18i %.14P %.8j %.10u %.2t %.10M %20b %R"

resulting in replacing the NODES information (always 1 since there is only the DGX) by the MIG required by the job (column TRES_PER_NODE):

JOBID      PARTITION     NAME       USER ST       TIME TRES_PER_NODE        NODELIST(REASON)

For more squeue format option, see

If your job is pending for a priority reason, you can get more information about it with the command sprio. Maybe the priority is given to more occasional users (fairshare), or maybe the other jobs are asking for less time (you can change the time requested for your job with the flag --time), or maybe there are simply just too much jobs at the moment. But don't worry, given enough time everyone will have their jobs completed!

Canceling the job

If you need to cancel your job, you can use the scancel command.

To cancel your job with jobid 29509 (obtained when submitting or through squeue), you would use:

$ scancel 29509

Wrapping it up

Now you should have all the tools to start your computations on the DGX! If you haven't already done so, you can explore the rest of the documentation, in particular the available partitions or the page on slurm jobs management.

Finally, don't hesitate to contact us at dgx_support@listes.centralesupelec.fr if you have any questions about the problems you're experiencing or if you'd like to suggest additions to this documentation.

Happy computing!